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Abstract 

We explore the effects of over-specificity in learning algorithms by investigating the behavior 
of a student, suited to learn optimally from a teacher B, learning from a teacher B' / B. We 
only considered the supervised, on-line learning scenario with teachers selected from a particular 
family. We found that, in the general case, the application of the optimal algorithm to the wrong 
teacher produces a residual generalization error, even if the right teacher is harder. By imposing 
mild conditions to the learning algorithm form we obtained an approximation for the residual 
generalization error. Simulations carried in finite networks validate the estimate found. 

PACS numbers: 89.70.Eg, 84.35. +i,87.23.Kg 
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I. INTRODUCTION 



Neural networks are connectivist models inspired on the dynamical behavior of the brain 
(author?) They are not only theoretically interesting models, they can also be used in 
a number of applications, from voice recognition systems to curve fitting software. Probably 
the properties that make neural networks most useful are their potentiality to store patterns 
and their capability for learning tasks. 

One of the most well-studied types of networks is feed-forward. What characterizes a 
feed-forward network is that the flux of information follows a non-loopy path from input 
;o output nodes, making the information processing much faster. Perceptrons (author?) 
2! are feed- forward networks with no internal nodes and only one output; they have been 
utilised for a^umber theoretical studies and applications of statistical mechanics techniques 
(author?) [sj]. In particular, the knowledge of Hebbian learning algorithms in an on-line 
scenario is quite complete. 

In the present article we study the ability of a student J, using an algorithm for learning 
optimally from a specific teacher B, to learn from a teacher B'. If a student is adapted to 
learn from a difficult teacher, it is not unreasonable to expect that it will be able to learn 
from an easier one. To formally analyze this problem we need to quantify the hardness 
of the teachers, set up the scenario where the learning process would take place and thus 
quantify the student's performance. 



Attempts to quantify hardness as an inherent property of t 
origin to many formal definitions of complexity (author?) 



le observed object have given 



Recently (author?) 



L. Franco has proposed to quantify a (Boolean) function's hardness by the size of the 
minimal set of examples needed to train a feed-forward network, with a predetermined 

10, that in 



architecture until reaching zero prediction error. He also found (author?) 
this minimal set there are many pairs of examples that, although only differing in a finite 
number P = 1,2, .. . of entries, they have different outputs, implying that these examples 
are located at each side of the classification boundary (similar to the support vectors for 
SVMs (author?) Further investigation showed that the average discrepancy of the 

function's outputs (measure over neighboring pairs) is correlated to the generalization ability 
of the network implementing the function. In order to contour the use of the neural network 
and its minimal training set, Franco proposed to use the average distance sensitivity directly 



2 



as a measure of the function's hardness. This is probably the most suitable measure for our 
study given that the nature of the measure itself is linked to the concept of generalization 
ability. 

The hardness measure we will use is the average output discrepancy taken over all pairs of 
inputs at a given Hamming distance P. Formally, for a given Boolean function / : {±1} 
{±1} , the Pth distance sensitivity component c)p [/] is the functional 
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/(S)/(S') 



(1) 



se{±i}" ^ ' S'enp(s) 

where (S) = |s' G {±1}^ | Y!^j=x^ {S^S'^ = p}. fip(S) is the set of inputs S' that 
differ from S in P entries. 

Dilution gives rise to networks with fewer connections, which can be more efficient in 
solving tasks and can be more easily implemented in hardware. Dilutedperceptrons have 



n 



14, 



15 



16, 
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been widely studied using statistical mechanics techniques (author?) [13|, 
llSl l and have also been studied as an approximation to more difficult Boolean functions 

hn 

(author?) [19|, |20|. Probably the most important features of diluted perceptrons related to 
the present work are the existence of analytical expressions for the sensitivity component 
dl]) and the associated optimal learning algorithm (see below). 

Consider a perceptron characterized by a synaptic vector B*^™) G that classifies binary 
vectors S G {±1}^ with labels Ub G {±1} according to the rule (Tb = sgn (B^™) ' ^) ■ If 



[B(™)]i = 6{i G Im)0{^/N/m) + 6{i ^ lm)o{^/m/N) where G {1, 2, . . . , A^} is a set of m 
(odd) different indexes 1 < z < we have a diluted binary perceptron. In our calculations 
we will consider [B'^™')]^ = 5{i G 1^) ^/N/m where m <^ N will be kept finite. 

For the binary perceptron B^™) the distance sensitivity component ([T]) in the large system 
limit (P < oo with p = P/N < oo) c)('")(p) is given by jMI 

(m-l)/2 
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As it is shown in the Appendix [Al and following (author?) [20||, d^'^\p) are a family of 
concave functions, ordered according to c)^™-*(p) < d^™'^'^\p) G (0, |). Therefore, the order 
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given by the hardness measure coincides with the order given by m, thus the larger m is the 
harder the Teacher. 

Another reason that appeals for using a diluted perceptrons as a teacher is that it is 
possible to obtain the correspondent optimal learning algorithm analytically. In a supervised 
on-line learning scenario, the synaptic vector of the student perceptron J is adjusted after 
receiving new information in the form of the pair (S, ctb) , following the rule 

T — T -LP ^BncwSncw 
"new — "J old ~r \/J^ ' 



where J G M^, ctb = sgn(B ■ S) is the classification given to the example by the teacher B 
and F is the learning amplitude or algorithm. The parameters of the problem are 

,^±1, Q^Ll, u 



|B| ' ^ N ' \B\\J\ 

where h is known as the student's post-synaptic field, b is the teacher's post-synaptic field, 
i.e. sgn(6) = ctb, Q is the normalized length of J and R is the overlap between teacher and 
student. 

Following (author?) we found that the equation of motion for the overlap R in terms 
of the total number of examples received aN, in the large size limit — » oo, is 



dR 



da \y/Q 



(NW-^-^) , (3) 



where (■)^ represents an average over the distribution V{4>) and = ash. The solution of 
this equation represents the evolution of the overlap R as a function of the time a. 

Remembering that the generalization error is defined as Cg = arccos(-R)/7r, and that the 
learning curve is the error as a function of a, we define the residual error as the asymptotic 
value of the learning curve at large values of a, i.e. e* = lima^oo ^^(q;). 

By the application of a variational technique it is possible to obtain an expression for the 
optimal algorithm Fop. The optimal algorithm is the algorithm that produces the fastest 
decaying learning curve and can be generically expressed as 

i^op = ^[(|6|),|,-i?0 
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II. ANALYTICAL RESULTS 



Following (author?) [22| we can prove (see Appendix [B]) that, for a perceptron with 
dilution m 

(m-l)/2 



°P R 



(6) 



where /i^ = {2k + V)l^pm and A/'(x|/i, a^) is a Normal distribution centered at [x with 
variance cr^. Observe that (l5|) is needed for computing the evolution (l3|), and (JG]) represents 
the optimal learning algorithm. 

Suppose that the teacher is characterized by a dilution me and the student implements an 
algorithm ([6]) for learning a Teacher perceptron with dilution m. This is equivalent to having 
prepared a student to learn optimally from B^"^) and now exposing it to B("^b) ^ B(™). Let 
us define the quantity 

T{mM^m,\^,m-R<^- (7) 

In this settings, the algorithm has the form F*-"^^ = ^T(0|i?, m) and the distribution of 
is a function of mB- The evolution of the overlap R is given now by the equation ([3]) 

^ = /ir(0|i?,m)r(0|/?,mB) - ^T\m.^)) 

da \R 2R /^i^^ 

which can be reduced to 

^ = 2 {rm, m) rm, mB)),|„^ - {T'm, m)>^|^^ (8) 
= (r2(0|/?,mB)> . - {[T{<p\R,m^)-T{<p\R,m)f).. . (9) 



The overlap R grows from zero to a stationary value, thus we expect the second term at 
the RHS of (ini) to be smaller than the first one. In the asymptotic regime (a oo) the 
derivative is zero, implying that no further changes are expected in the overlap, and then 
we have that 

(r2(0|i^^mB)>^l^^ = ([r(0|i?^m) -r(0|i^^mB)]'>^l^^ (lo) 
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Figure 1: T{(f)\R,m) (full curve) and the probability of the of the stability T'{4>\m) (dashed curve) 
against (p in units of Xj ^pm for R = \ (upper panel) and R = 0.99 (lower panel) for m = 9. Observe 
that for R = 1 (upper panel) the average of the LHS (fTOl) involves only the points at which T((/)|l, m) 
is zero, whilst for i? < 1 (lower panel) the same average requires a more intensive calculation. 



where R* = liniaioo -R(a). 

Observe that if m = me, the second term of the RHS of ((HI) is zero, the algorithm applied 
is optimal and the overlap reaches R* = 1 with the smallest possible set of examples. If 
perfect learning implies i?* = 1 it is natural to ask for what values of m the student can 
learn a teacher with dilution me without errors. From (l4|) and (I5|) we have that, for R = 1, 

(mB-l)/2 

(me - l)/2 - 



V{(j)\m-B) 
T(0|l,mB) 



2^6 - 
1 



k=0 



(mB-l)/2 

1+ 6'(V^0-2A;) 



k=l 



(11) 

(12) 
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r(0|l,m) 



(m-l)/2 

1+ J2 0{V^(t)-2k) 



k=l 



(13) 



The LHS of (fTO]) . averaged over (fTTl) is zero (see figured]). This is due to the fact that 
T{{2k + I)/ ^/mB\l,m■B) = 0. Therefore, in order to satisfy (fTOll we also need that T((2/c + 
l)/^mB|l,''n) = 0. Particularly, for /c = these two equation imply that 

(m-l)/2 ^ 

Z «(,/!-- 

Therefore 



fc=i 



— = 2g+l, 14 
me 

where g is a suitable, non-negative integer. Thus, the condition for i? = 1 to be a solution 
of ^ is that there exist g G N U {0} such that m = (2g + ifmB- 

If this is not true, the solution of (fTOll is at i?* < 1. We will present an approach based 
on the assumption that the root R* occurs in a regime where the Gaussian distributions 
Af{(t>\R*fik, 1 — R*^) in (SI) and ([5]) have a small overlap. This could be ensured if the 
separation of two adjacent Gaussian components were larger than two standard deviations, 
i.e. 

2R 



R*\^k - /ifc+i| = ^ > 2Vl - (15) 



m 



1 — R^ 

1 > m (16) 



At i? = 1 the curve T{(f)\l,m) is discontinuous at = 2k/ ^/m and the probability V{(t)\m) 
is a linear combination of delta functions centered at = {2k + l)/^/m (upper panel of 
figured]). For R<1 (figure dl lower panel), T(0|i?, m) is continuous and V{(t)\m) is a linear 
combination of Gaussian distributions centered at = R{2k + 1)/ ^/m with variance 1 — R^. 
In both cases T(0|i?, m) appears to be a periodic function of with period 0r = 2R/y/m, 
in the support of P(0|m) C M, i.e. 

T(0|_R, m) ^ T {(j) + n(j)T\R.,'m) , 

and particularly for i? = 1 we have that 

r(0|l,m)= 0[(2(^+l)-v^0)(v^0-2£)]^^^-0 . (17) 



We can approximate T{(f)\R,m) by a suitable superposition of Normal distributions. Con- 
sider the superposition 



f{(P\R,m)= dr^(r)7V(0|r,l 



(18) 

To determine the function g{r), we perform a variational calculation to minimize the error 
functional 

e[g] = 1 [ d4> fr(0|i?, m) - f m)] ' . 



Observe that the optimal function go{r) is the solution of the equation 
implies that for all tq G M we have that 
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0, which 
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sis 



T{(f)\R,m) -f{(l)\R,m) M{(f)\ro,l - R^) = 



in particular if i? = 1 (we assume that go{r) is independent of R) 



d0 



T(0|l,m)— / dr go{r) 5{(f) — r) 



5(0 - ro) 



goiro) = T(ro\l,m) 



Therefore 



r{(f)\R,m) 



drT{r\l,m)N'{(j)\r,l- R^ 



•Sis 
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E / 
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R 
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.2 i-l Krn-L)/i 

- dtt V Af{(j)\R{2i+l -t)/y^,l- R'^ 

£=0 



Let us define the integral 



mi ,m2 



d0 P(0|mB) r(0|i?, mi) r(0|i?, ma). 



Following the development of Appendix [0 we have that 



m\ ,1712 



l-R^ 
where 5'^ are given by 



(mB-l)/2 
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B-l 



A;=0 



me 



(mB-l)/2-A;/^'"^''='^'"^''^ 



2A; + 1 1 
m^ 2 2 
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1 2k+l 



(19) 



(20) 



(21) 



(22) 
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where |r| is the closest integer to r G M. 

Observe that from (JH]) we have that in the asymptotic regime 



= 2(r(0|i?*,mB)r(0|i?*,m))^ 



|niB 



[T\<P\R\m))^ 



|mB 



2 / — / 

'^'^m^^m ^ m,m, ; 



and observing that J^me.m = 1 — -R^ (given that fe = V/c) and 

(mB-l)/2 



= 1 - -R^ + _i ( 

A:=0 ^ 



me - l)/2 - A; 



then 



.*2 



(mB-l)/2 



1 + 



2mB-l / ^ 



fc=0 



m,k 



{ruB - l)/2 - fc, 

and observe that S^i^ = iSm = {2q + l)^mB, g G N, which is consistent with (fT4l) . 



(23) 



III. NUMERICAL RESULTS 



Using (l23l) we plot e* = eg{R*) as a function of ^Jm/m-B (see figure [2]). 

To validate our result shown in ( l23l ) we run a series of numerical experiments consisting 
of a student learning from a Teacher with only one bit (me = 1). The student updates its 
synaptic vector following ([2]) using a learning algorithm given by ([6]) with m = 1,3, . . . , N. 
To compute the generalization error we average over 50 realizations of the learning curve. 
The maximum number of examples considered was 16 000. In figure [3] we present the Cg 
as a function of for m = 1, 5, 9, 13, 25, 27 and network size = 51. We have chosen 
the exponent | to better show the curve features at short times and the approach to the 
asymptotic regime. It is clear from the picture that for m = 1^,3^,5^ the generalization 
error for large a drops to zero as predicted. In order to extract the asymptotic behaviour of 



2l|. 



the curves we applied the Bulirsch-Stoer algorithm (author?) 

In figure [4] we present the extrapolated values of the learning curves together with the 
values estimated by the application of (1231) as a function of y/m. The error bars are estimates 
obtained also by the application of the Bulirsch-Stoer algorithm. 
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Figure 2: Generalization error in the asymptotic regime e* as a function of y^m/mB, for me 
1,3,7. We have use ((231) to compute the overlap R* . 



IV. CONCLUSIONS 



We studied the generalization capabilities of a student optimally adapted for learning 
from a teacher B, when learning from a teacher B' 7^ B. We observed that, although the 
algorithm the student uses may be suited for learning from a harder teacher, (as defined 
by Franco) that does not guarantee the success of the process, as revealed by (l23ll . This 
behavior is due to the extreme specialization implied by the algorithm (l6|). When this 
algorithm (with parameter m) is applied to learn from a teacher with me < rn^ the student 
tries to extract information from bits that the teacher does not use for producing the correct 
classification. These interference effects produce mostly bad results, originating a residual 
error in the asymptotic regime. In this sense, the algorithm F^^^ is worse than the Hebb 
algorithm Fnebb = 1- 

Despite the discrepancies shown in figure (H our estimate (l23l) reproduces faithfully the 
qualitative behaviour observed in the simulations. There are two sources of uncertainty that 
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Figure 3: Generalization error as a function of a4 , for a teacher with dilution me = 1 and students 
with m = 1,5,9, 13,25,27, for a network with N = 51. The curves that corresponds to the Hebb 
algorithm [F = 1, long dashed) and m = oo (dot dashed) are presented as a reference. 

may account for the observed discrepancies: the (finite) size of the network used and a not 
sufficiently large a. 

From figure [2], the algorithm obtained by taking the limit m — oo in ([6]) 



for all TUB- The Hebb algorithm -Fnebb = 1 also produces learning curves with zero residual 
error. In figure [3] we observe that the Hebb algorithm performs better than Fo"^^ . This is 
not a contradictory result. Fo"^^ is the algorithm that has the best average performance 
considering a homogeneous distribution of teachers over the iV-sphere. For a measure zero 
subset of vectors embedded in the A'^-sphere, like the perceptrons with finite dilution m, 
-Pop"'* could perform worse than the Hebb algorithm, as it seems to be the case here. 




where H{x) = dye "2 /v/27r; as reported by (author?) 22|, produces zero residual error 
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Figure 4: Comparison of the asymptotic value of the generalization error using ((23]) and the ex- 
trapolated values of the curves presented in figure [3l 

In order to obtain the fastest decaying learning curve, a student has to infer the correct 
dilution of the teacher for choosing the appropriate learning algorithm. Developing an 
efficient technique for inferring the correct dilution parameter will be the subject of our 
future research. 
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Appendix A: DISTANCE SENSITIVITY 



S and S' are vectors that differ in exactly P bits, i.e. 0{—S^.S'^^) = P. Taken S as 

a reference, we can construct a Pth neighbor S' by choosing without replacement P indexes 
from 1 to iV and flipping the correspondent entries in S. There are (p) different ways to 
choose P indexes, each one creating a different set of indexes Ip. Introducing the scaled 
variables fi = w ■ S/ \fN and /x' = w ■ S'/ \/N by means of Dirac delta functions and adding 
up over all possible configurations S, we can express the discrepancy component as 

\P J J~oo 27r 2n 

The fraction of sets Ip with n < m indexes £ < m is (™) (pi™) / (p) and observing that 
in the limit P < N ^ oo with P/N = p < 1 we have that 

p<N]oo \PJ \P-nJ ^ ^ ^' 
From equation (lAll) we have that 

271 2ti 



CO 

xX:(:)."(l-.r-"co=(^)cos(^) 



?i=0 

By adding up the sum, opening up the cosines and applying the identity 0{ab) = 0{a)0{b) + 
0{—a)0{—b), the expression for the sensitivity gets reduced to 

m / \ r „ -1 2 

c)(™)(p) = 2 5^ n (1 - 2pr / cos(/i/v^)™-" smifi/V^r , 

n=0 V""/ 

where the notation / V{^, fi) f{^, fi) stands for (27r)^^ Jg°° dyU d/te^*^'^ /(/i, /t). The in- 
tegrals to be solved are 



^ J V (r^, fi) cos(r/)- 
b"^ = j V {t], fj) cos(r})"-2" sin(7})2" 
C = [ P(r/,r})cos(r7)"*-(2"+i)sin(r))2"+i. 
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Before computing the integrals observe that for a\\ A > and B > 



V [f], fj) smiyAfj) 



OO /"OO 



dry / dryexp {—ifjr]) [exp (ifjA) — exp (—irjA)] 

J —OO 

POD 

dr] / dfj [exp [—ifi{r] — A)] — exp [—ir]{ri + A)]] 



47r Jo 



-- [OiA) - 0i-A)] 
i 

~2' 



(A2) 



similarly 



J V (77, 77) cos{Afi) cos{Bfj) = ^ [6>(A + S) + 0{-A - B) + e{A - B) + 0{-A + B)] 



1 
2 



(A3) 



and 



J V (77, 77) cos(A?7) sin(S?7) = ~ [e{A + B) - 0{A - B) + 0{-A + B) - 0{-A - B)] 

(A4) 



= --0{B-A). 
The first integral is (remember that m is odd) 

(m-l)/2 



VMcosifir^—^ r) V(rj,fi)cos[(m-2k)fi] 



(m-l)/2 



= - E 



fc=0 

The second integral is 



1 

2' 



(A5) 



cos(^)'"-''^sin(f7)2" 



^ (m-l)/2-n . 

I — n \ 



fe=0 



m — 2n 
k 



cos[(m — 2{k + n 



j=0 
(m-l)/2-n 



cos[(2n — 2j)r]] + 



= — y 

Om— 1 / -I 



1 2 



A;=0 
m— 2n 



2m- 1 2 

0. 



m — 2n 
k 

1 /2n 
'2 V n 



n-l 



+ 



i=o 

1 /2n 



2n 



+ 



2n 
n 

1 /2n 



22n+l V ^ 



22n+l 



(A6) 
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And the last integral is then 



V {f], fj) cos(r})"^-(2"+i) sin(r/)2"+i 

V [t], fi) cos(r})'"-(2"+i) [1 - cos\fi)] " sin(r/) 

lv{v,v)Y.}-^y f^l cos(r))-(^"+^)+^^sin(r)) 

n 



n\ {-i) ( 1 f m - 1 + 2{i - n) 



i=0 



£j 2'"-W-") \2\{m-l)/2 + i 



n 



(m-l)/2+^-n-l 



+ 



E 

fc=0 



m - l + 2(£-n) 
k 



0[l-{m-l + 2{i-n)-k)] 



£=0 



i /2r7,\ / m — l—2n 
~ \n) V(m - l)/2 - n 

We have that, for all m odd 



n\ fm- 1 + 2(£-n)) 
i) V(m- l)/2 + £-n 

(m- 1)/2V^ 



(A7) 



{m-l)/2 



2n+l 



2 2 



n=0 



where 



1 



2n\ / m-l-2n \ /(m-l)/2 
n J \{m — l)/2 — n) \ n 



-1 



(A8) 



(A9) 



4™-! + 1, 

Observe that c)(™)(p) is concave in p G [0, |] (it is simply a sum of an affine plus concave 
functions) and < for all p G (0, |). To demonstrate the latter we use that 

t)("^)(0) = and > Vs. Thus, from (lA8| at p = we have that 



Therefore 



a 



(m-l)/2 
s=0 



m+2 
(m+l)/2 



1 Vm > 1. 



(m-l)/2 



(AlO) 



simply by applying (lAlOl ) to m and to m + 2. Observe that (1 — 2p)" < (1 — 2p)"' for all 
n> n' and p G (0, |), thus 



(m-l)/2 



a 



m+2 
(m+l)/2 



(i-2pr+^ < 5^ ( 



„m m+2 



) (1 - 2p) 



2s+l 
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(m+l)/2 (m-l)/2 

(1 - 2pr^' < 5^ ar (1 - ^pr^' 

s=0 s=0 

and thus c)('")(p) < 

In the large m limit we have that 

lim D^"^\p) = -arcos(l - 2p), 

mtoo TT 

which is the expected result (author?) [2 



Appendix B: OPTIMAL LEARNING ALGORITHM 



The basic ingredient to compute the optimal learning algorithm is the joint probability 
distribution of the variables ctb, h and b. Given that V{a-B, h,b\m) = 0{o--Bb)V{h,b\m) we 
will start our inference task by computing the distribution of the post-synaptic fields. 

V{h,b\m) = J-S/|J|) (5(6-B-S/|B|))s 



-00 2vr 2tt \ ^ V |J| |B| / / s 



and assuming that [B]j = ^0{m + l—j) we can suppose that the student learns this rule 
in such a way that [3]j ^ J'0{m + 1 — j) +ej, where Ej <^ |J| are i.i.d. variables. Therefore 

J R e 



|J| |J| 



where R is the teacher-student overlap and e = Xl^i^i/'^- Let us define the variables 
with the properties of Yl^=i = ^^^d 



i=i 



Thus the trace over the spin variables gives 



,.^J-S .J.B-S 
exp ( + * "igT 



AT 



exp 



, MB] 



|B| 



j>m 
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Y\ cos 



hR + b 



m 



+ hipj 1 Y\ cos 



COS 



hR + b 



m 



1 — hipj tan 



cos 



exp 
hR + b 



n 

" 9 2^ LT|2 



|J| 

hR + b 



jr>m 



exp 



J 

1 



\j>m 



Therefore, and using that m is odd, 

(m-l)/2 



1 i.' "--^; /^ / \ r 
k=0 ^ ^ 



Ah d6 
27r 27r 

(m-l)/2 



exp 



h — ihh — z66 cos 



(m - 2A;) 



6 + m 



m 



M{h\RbA-R^)- y: 

k=0 ^ 



m 

(m- l)/2- A; 



[S {b - fXk) + S (b + fXk)] 



(Bl) 



where /ifc = (2A; + l)/A/m and A/'(x|/x, a^) is a Normal distribution in x, centered at n, with 
variance a^. 

From (IBll) we can compute the joint distribution of the variables h and ctb 

(m-l)/2 



- l)/2-k 



, , , 1 \ ^ / r 

Via-B, him) = — y / 

fc=0 

which implies that 

/oo 
dhV{aB, h\m) 5{(t) - ash) 
-oo 



^{aBh\RfikA-R^), (B2) 



crB=±l ■ 



(m-l)/2 



— r 

m— 1 / -I 



k=0 



m 

(m- l)/2-k 



Ar{<f)\Rfik,l-R^). 



(B3) 



The conditional probability of the field b given ctb and can be obtained from (IBip and 

It is a simple inference exercise to find the conditional distribution of the field b given 
the stability (p 



V{b\^,m) 



2 e£o'^/'U-iT/2-J-^('^I^/^^>i-^^) 
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The conditional expectation of the field |6| is 



db\b\V{b\(l)) 



i?2 



(B4) 



Appendix C: DERIVATION OF (l2B 



In this Appendix we continue the development of ( !20f l 



mi,m2 



^V{<j)\m^)T{<j)\R,rm)T{<j)\R, m^) 



2™B 



^ (mB-l)/2 
A;=0 



TUB 

niB - l)/2 - k 
R 



(mi--l)/2 (m2-l)/2 
^1=0 ^2=0 



dtiti / dt2t2 



'mi 



(2£i + 1 - ti) AT 



:(2£2 + l-t2) 



where all the Normal distributions have exactly the same variance 1 — R^. The integral over 
(f) is simple and produces a bi-variate Gaussian distribution in ti and t2 



(mB-l)/2 



2niB 



— r 

B-l 



A:=0 



me 

(me - l)/2 - A; 



(»ni-l)/2 (m2-l)/2 



/ dttit2A/'(t|t^,,^2,fc;S; 

J sit 



where t = (ti,t2)"^, = (-1, 1) x (-1, 1), t^i/2,fc = (v^^^i.fc' v^'^^.^)"^ 

2£j + 1 2fc + 1 



nil |i/mim2 



i?2 



(CI) 

(C2) 
(C3) 



ii/mim2 m2 

From (fTSll all the entries of the covariance matrix (IC3P are small, therefore all the distribu- 
tions are concentrated around t^^^j fc. Let be the vector that corresponds to the largest 
term in (IClll . Its components are 



rrij 2 k 



rriB 



m 



m-B 



'^{2k+l), 



(C4) 
(C5) 
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where |[r]| is the closest integer to r G M, thus y/rfijS* 



-1,1). All the other vec- 



tors can be expressed as t^^ ^j^^ 
1, 



+ 2n, where n = (ni, 71,2)"'", nj = —I 



+ 



-1,1, 



+ {rrij - l)/2. We have that 

(mB-l)/2 



1712 



2ms 



— y 

B-1 



fc=0 



me 



(r/iB — l)/2 — A;/ y/rn^rn^ 



\[ dttit2A/'(t|t*;S) + V / dttit2Ar(t|t* + 2n;S) I . 

Observe that the vectors are always strictly inside the domain They can never be 
in the boundary of the domain given that rrij is odd then the argument of the RHS of 
(1C4I1 is never in Z1/2 (which would produce the largest possible value of ^/nfJ6^. |^). Thus, 
the largest contribution to the sum over n is of O [exp (— e^/ max {A G spec(S)})] , where 



e ~ 1 - \^s:^^^k\ 



> \ and 



max {A G spec(S)} 



mi + m2 + \l ml + rri^ — mim2 



<1, 



according to ( fTSll . Within the same approximation error we can suppose that the centre of 
the zero-th Normal distribution is located inside the domain and sufficiently farther from 
the boundary. Thus 

(mB-l)/2 / \ R2 /• 

/ dttit2Ar(t|t*-S) + 

2mB-i ^ V (me - l)/2 - kj ^/m^ U 1211,,; 



mi,m2 



k=0 



+0 exp 



max {A G spec(S)} 



Thus 



(mB-l)/2 



— T 



fe=0 



rriB 



dU UAf t 



m-Q — l)/2 — kJ y/mim2 

1 



l-i?2 



f dtiti7V(ti 

.7 —00 V 

2mB-l / ^ 



1 3 l-i?2 



i?2 



(mB-l)/2 



A:=0 



[niB - l)/2 - A; 



A* A* 
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